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Abstract 

Coresets are efficient representations of data 
sets such that models trained on the coreset 
are provably competitive with models trained 
on the original data set. As such, they have 
been successfully used to scale up cluster¬ 
ing models such as K-Means and Gaussian 
mixture models to massive data sets. How¬ 
ever, until now, the algorithms and the corre¬ 
sponding theory were usually specific to each 
clustering problem. 

We propose a single, practical algorithm to 
construct strong coresets for a large class 
of hard and soft clustering problems based 
on Bregman divergences. This class in¬ 
cludes hard clustering with popular distor¬ 
tion measures such as the Squared Euclidean 
distance, the Mahalanobis distance, KL- 
divergence and Itakura-Saito distance. The 
corresponding soft clustering problems are di¬ 
rectly related to popular mixture models due 
to a dual relationship between Bregman di¬ 
vergences and Exponential family distribu¬ 
tions. Our theoretical results further imply a 
randomized polynomial-time approximation 
scheme for hard clustering. We demonstrate 
the practicality of the proposed algorithm in 
an empirical evaluation. 


1 INTRODUCTION 

Clustering is the task of partitioning data points into 
groups (or clusters) such that similar data points are 
assigned to the same cluster. It is widely used in ma¬ 
chine learning, data mining and information theory 
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and remains an important unsupervised learning prob¬ 
lem: According to Wu et al. (2008), Lloyd’s algorithm, 
a local search algorithm for K-Means, is still one of the 
ten most popular algorithms for data mining. 

Over the last several years, the world has witnessed 
the emergence of data sets of an unprecedented scale 
across different scientific disciplines. The large size of 
such data sets presents new computational challenges 
as existing algorithms can be computationally infeasi¬ 
ble in the context of millions or even billions of data 
points. Clustering is not exempt from this problem: 
Popular algorithms to solve both hard clustering prob¬ 
lems (such as K-Means) and soft clustering problems 
(such as Gaussian mixture models) typically require 
multiple passes through a data set. 

A well established technique for scaling several clus¬ 
tering problems (including K-Means and GMMs) is 
based on coresets - a data summarization technique 
originating from computational geometry. A coreset 
is an efficient representation of the full data set such 
that models trained on a coreset are provably competi¬ 
tive with models trained on the original data set. Since 
coresets are typically small and easy to construct, they 
allow for efficient approximate inference with strong 
theoretical guarantees. However, until now, the algo¬ 
rithms and the corresponding theory were specific to 
each clustering problem. 

Bregman divergences are a class of dissimilarity mea¬ 
sures that are characterized by the property that the 
mean of a set of points is the optimal representative of 
that set. As such, the Bregman hard clustering prob¬ 
lem generalizes a variety of important clustering prob¬ 
lems such as K-Means and various information theo¬ 
retic extensions (Banerjee et ah, 2005). A family of 
corresponding soft clustering problems are denoted by 
Bregman soft clustering (Banerjee et ah, 2005) and are 
closely related to fitting mixture models with exponen¬ 
tial family distributions. Hence, Bregman clustering 
offers a natural framework for studying a variety of 
hard and soft clustering problems. 


t These authors contributed equally to this work. 
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Divergence 

Squared Euclidean 
Mahalanobis 
Relative Entropy 
Itakura-Saito 
Harmonic (a > 0) 
Norm-like (ct > 2) 
Exponential Loss 
Hellinger 


Table 1: Selection of /.(-similar Bregman divergences (Ackermann and Blomer, 2009). 


Our contributions. In this paper, we consider the 
problem of scaling Bregman clustering to massive data 
sets using coresets. As our key contributions, we: 

• provide strong coresets of size independent of the 
data set size for all /(-similar Bregman divergences, 

• prove that the same practical coreset construction 
works for both hard and soft clustering problems, 

• provide a randomized polynomial-time approxima¬ 
tion scheme for the corresponding hard clustering 
problems, 

• establish the combinatorial complexity of mixture 
models induced by the regular exponential family, 

• demonstrate the practicality of the proposed algo¬ 
rithm in an empirical evaluation. 


2 BACKGROUND 


Bregman divergence. For any strictly convex, dif¬ 
ferentiable function </> : JC —> R, the Bregman diver¬ 
gence with respect to <f> for all p,q £ K, is defined as 

d<t,(p, q ) = <t>(p) - <t>(q) - - <?)• 

For example, setting 1C = R d and cj>{q) = \\q\\\ results 
in d ( j,(p 1 q) = \\p — (/Hi which is the squared Euclidean 
distance. Bregman divergences are characterized by 
the fact that the mean of a set of points minimizes 
the sum of Bregman divergences between these points 
and any other point. More formally, for any Bregman 
divergence d</ and any finite set X C 1C, it holds that 
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Furthermore, Banerjee et al. (2005) proved that any 
function satisfying (1) is a Bregman divergence. 

/(-similar Bregman divergences are a subclass of Breg¬ 
man divergences related to the squared Mahalanobis 
distance. 1 The class of /(-similar Bregman divergences 

1 The squared Mahalanobis distance dA for any positive 
definite matrix A £ R dxd and any p,q £ R d is defined as 

d A (p, q) = (p- q) T A(P ~ <?)• 


includes many popular Bregman divergences such as 
the squared Euclidean distance, the squared Maha¬ 
lanobis distance, the KL-divergence and the Itakura- 
Saito distance. Several other divergences from this 
class are shown in Table 1. 

Definition 1 (/i-similar Bregman divergence) 

A Bregman divergence d^ on domain K, C is 
p,-similar for some /i > 0 iff there exists a positive 
definite matrix A £ R. dxd such that, for each p, q £ 1C, 

pd A (p,q) < d <p{p,q) < d A (p,q) 

where d A denotes the squared Mahalanobis distance. 

Coresets. A coreset is a weighted subset of the data 
such that the quality of any clustering evaluated on 
the coreset closely approximates the quality on the 
full data set. Consider a cost function that depends 
on a set of points X and a query Q £ Q that is 
additively decomposable into non-negative functions 
{/q (x)}xex, i-e. 

cost (A, Q) = 

x£X 

For example, for the K-Means clustering problem, 
/q(x) is the squared Euclidean distance of x to the 
closest cluster center in the set Q. The key idea be¬ 
hind coresets is to find a weighted subset C such that 
the cost of a query Q can be approximated on C by 

cost(C, Q) = w /q( c )- 

A weighted subset C is an e-coreset of X if it approx¬ 
imates the cost function of the full data set up to a 
multiplicative factor of 1 ± e uniformly for all queries 
Q£Q , i.e. 

(1 — e) cost(A, Q) < cost (C, Q) < (1 + e) cost(A, Q). 

Since the cost contributions /q(x) and the space of 
queries Q depend on the problem at hand, coresets 
are inherently problem-specific. 
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Coresets have been the subject of several recent pa¬ 
pers, with focus on unsupervised parametric and non- 
paranretric models (Feldman et al., 2011; Balcan et ah, 
2013; Bachem et al., 2015; Lucic et al., 2015), as well 
as in the context of empirical risk minimization (Reddi 
et al., 2015). 

As the coreset property bounds the approximation er¬ 
ror for all queries, the difference between the solution 
on the full data set and the solution on the coreset is 
bounded. Hence, one can use any solver on the core¬ 
set instead of the full data set and obtain provable 
approximation guarantees. 2 At the same time, core¬ 
sets are usually sublinear in (or even independent of) 
the number of samples which implies that one can run 
computationally intensive algorithms that would oth¬ 
erwise be infeasible. 

Additionally, coresets are a practical and flexible tool 
that requires no assumptions on the data. While the 
theory behind coresets requires elaborate tools from 
computational geometry, the resulting coreset con¬ 
struction algorithms are simple to implement. 

A key property of coresets is that they can be con¬ 
structed both in a distributed and a streaming set¬ 
ting. The constructions rely on the property that both 
unions of coresets and coresets of coresets are coresets 
(Har-Peled and Mazumdar, 2004). In fact, Feldman 
et al. (2013) use these properties to construct coresets 
in a tree-wise fashion which can be parallelized in a 
Map-Reduce style or used to maintain an up-to-date 
coreset in a streaming setting by applying the static- 
to-dynamic transformation (Bentley and Saxe, 1980). 

3 STRONG CORESETS FOR HARD 
CLUSTERING 

The goal of the K-Means clustering problem is to find 
k cluster centers C such that the quantization error 

COStk means (A, C ) \ min ||x c| 1 2 

z —' cgC 

xex 

is minimized, where X C denotes the set of data 
points to be clustered. By replacing the squared Eu¬ 
clidean distance || • || 2 with a Bregman divergence d^, 
K-Means clustering generalizes to Bregman hard clus¬ 
tering. In this problem the goal is to compute a set of 
k cluster centers C minimizing 

cost H (A, 0 = ^53 d C) (2) 

' ' xGX 

2 To account for weighted data, solvers can either be 
naturally extended (e.g. Algorithms 1 and 5), or weighted 
data points can be replaced by multiple copies of the same 
point after appropriate rescaling. 


Algorithm 1 Bregman hard /e-clustering with d^ 

Require: X, k, initial centers {/q, }Ui 

1 

repeat 
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where we define d^(x, C) = min ce c d^(x, c). For 
weighted data sets, the contribution of a point to the 
cost function is scaled by its weight v(x). 

The notion of hard clustering relates to the fact that, 
as in K-Means clustering, the minimum with regards 
to the set of cluster centers C leads to a hard assign¬ 
ment of each data point to its closest cluster center. 
Furthermore, the assignment boundary between two 
arbitrary centers is a (d — l)-dimensional liyperplane 
and the set of cluster centers C induces a Voronoi par¬ 
titioning on the data set X (Banerjee et al., 2005). 

Lloyd’s algorithm (Lloyd, 1982) solves the K-Means 
problem by iterating between assigning points to clos¬ 
est cluster centers and recalculating the centers as the 
mean of the assigned points. As the mean is an op¬ 
timal representative of a single cluster for any Breg¬ 
man divergence, Lloyd’s algorithm can be naturally 
generalized to the Bregman hard clustering algorithm 
detailed in Algorithm 1 (Banerjee et al., 2005). 

Each iteration of Algorithm 1 incurs a computational 
cost of 0(nkd). As the number of iterations until con¬ 
vergence can be very large, coresets can be used to 
scale Bregman hard clustering. Previous approaches 
either only considered weak coresets (Ackermann and 
Blomer, 2009) or impose prior assumptions on the data 
(Feldman et al., 2013). In contrast, we provide a signif¬ 
icantly stronger theoretical guarantee via strong core¬ 
sets, i.e. coresets for which the approximation guaran¬ 
tee holds for all possible queries. 

Definition 2 (Coreset definition) Let £ > 0 and 

k £ N. Let d<f, be a p-similar Bregman divergence on 
domain K and X C K be a finite set of points. The 
weighted set C is an (e, k) coreset of X for hard clus¬ 
tering with d ^ if for any set Q C K. of cardinality k 

|cost H (A,<5) — cost H (C,Q)| < ecost H (A,<3). 
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3.1 Coreset construction algorithm 

The idea behind the proposed coreset construction is 
straightforward: The objective function for Bregman 
hard clustering in (2) is additively decomposable over 
X and the contribution of each point is independent 
of all the other points. Hence, the cost evaluated on 
a uniform subsample of the data points is an unbiased 
estimator of the true objective function. 

While this intuition is simple, unbiasedness is not suf¬ 
ficient and uniform subsampling does not provide the 
strong theoretical guarantee required by Definition 2: 
Single points can potentially have a large impact on 
the objective function and force the sample size to 
£2(n). 3 In particular, this occurs if the clusters are 
imbalanced or if there are points far away from the 
bulk of the data. Moreover, the coreset property in 
Definition 2 needs to hold uniformly for all queries in 
the (possibly infinite) set Q. 

To obtain the strong coreset property, we propose a 
coreset construction based on the importance sam¬ 
pling framework by Langberg and Schulman (2010) 
and Feldman and Langberg (2011). Consider a Breg¬ 
man hard clustering problem defined by a //-similar 
Bregman divergence d^ on domain K, and a finite data 
set X C 1C. The construction consists of two steps: 

Step 1. We first find a rough approximation ( bicri¬ 
teria approximation) of the optimal clustering. As /t- 
similar Bregman divergences are closely related to the 
squared Mahalanobis distance, we show in the proof 
of Theorem 1 that it is sufficient to find a rough ap¬ 
proximation with regards to the Mahalanobis distance. 
To this end, we apply .Desampling which was previ¬ 
ously analyzed by Arthur and Vassilvitskii (2007) in 
the context of the popular k-means++ algorithm. The 
idea is to sample data points as cluster centers using 
an adaptive sampling scheme: the first cluster center 
is sampled uniformly at random and additional points 
are then iteratively sampled with probability propor¬ 
tional to the minimum squared Mahalanobis distance 
to the already sampled cluster centers. We prove that 
this procedure detailed in Algorithm 2 produces a solu¬ 
tion that is, with constant probability, 0(\ogk) com¬ 
petitive with the optimal clustering in terms of Ma¬ 
halanobis distance. Under natural assumptions on the 
data, a bicriteria approximation can even be computed 
in sublinear time (Bachem et ah, 2016). 

Step 2. The rough approximation is then used in Al¬ 
gorithm 3 to compute an importance sampling distri¬ 
bution. The idea is to sample points with a potentially 

3 A simple example is a data set where a first cluster 
contains n — 1 points at a single location and a second 
cluster consists of one point arbitrarily far away from the 
first cluster. 


Algorithm 2 Mahalanobis Desampling 
Require: A, k, d^ 

1: Uniformly sample x £ X and set B = {a;}. 

2: for i <— 2,3,..., k 

3: Sample x £ X with probability y. , dA d 'a(x' b 

and add it to B. 

4: return B 


Algorithm 3 Coreset construction 
Require: A, k , B , m, d^ 

1 : 0 - 1 — 16(log k + 2) 

2: for each b, in B 

3: Bi 4— Set of points from X closest to bi in terms 

of d^. Ties broken arbitrarily. 

4: c* \k\Y. X 'ex A A{x',B) 

5: for each b, £ B and x £ Bi 

6: <- + **•'&£*'■*’ + g 

7: for each x £ X 

8 : p(x) £- s(x)/J2 x 'ex s ( x ') 

9: C ■£- Sample m weighted points from X where each 
point x has weight and is sampled with prob¬ 

ability p(x). 

10: return C 


high impact on the objective more frequently but as¬ 
sign them a lower weight. The sensitivity s(x) of a 
point x £ X is the maximal ratio between the cost 
contribution of that point and the average contribu¬ 
tion of all points (Langberg and Schulman, 2010), i.e. 

/ \ d</,(x, Q) 

crlx) = max 1 ----. 

Qc.iC:\Q\=k -pry £ x , eA . d^(a: / , Q) 

We derive an upper bound for a(x) and use it as the 
sampling distribution in Algorithm 3. Intuitively, this 
specific choice bounds the variance of the importance 
sampling scheme (Feldman and Langberg, 2011) and, 
if we sample enough points, produces a coreset. This 
result is formally stated and proven in Theorem 1 
where we provide a bound on the required coreset size. 

Finally, we can solve the Bregman hard clustering 
problem on the coreset using a weighted version of the 
Bregman hard clustering algorithm. 

3.2 Analysis 

Our main result is that this construction leads to valid 
coresets for the Bregman hard clustering problem. In 
particular, the required coreset size does not depend 
on the size n of the original data set. 

Theorem 1 Let e € (0,1/4), 6 > 0 and k £ N. Let d^ 
be a p-similar Bregman divergence on domain K and 
denote by d a the corresponding squared Mahalanobis 
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distance. Let X be a set of points in K. and let B C X 
be the set with the smallest quantization error in terms 
of Aa among ©(logy) runs of Algorithm 2. Let C be 
the output of Algorithm 3 with 


to = © 


dk 3 + k 2 log | \ 
p 2 e 2 )' 


Then, with pj'obability at least 1 — 5, the set C is a 
(e,k)-coreset of X for hard clustering with d^. 


The proof is provided in Section C of the supplemen¬ 
tary material. The main steps include bounding the 
sensitivity in Lemma 2 and bounding the combinato¬ 
rial complexity of Bregman hard clustering in Theo¬ 
rem 6. In practice, it is usually sufficient to run Algo¬ 
rithm 2 only once and to fix the coreset size to instead 
of £ (see Section 5). 

As an immediate consequence of the coreset property, 
the optimal clustering obtained on the coreset is prov- 
ably competitive with the optimal clustering on the 
full data set when evaluated on the full data set. 


Lemma 1 Let e £ (0,1) and let d^ be a p-similar 
Bregman divergence on domain 1C. Let X C 1C be a 
data set, k £ N and C be an ( e/3,k)-coreset of X for 
hard clustering with d^. Let Q* x and Q £ denote the 
optimal set of cluster centers for X and C respectively. 
Then, 

cost H (A, Q/,) < (1 + e) cost H (A, Q* x ). 


The proof is presented in the Section C of the supple¬ 
mentary material. 

3.3 Randomized polynomial-time 
approximation scheme 

The fact that the size of the proposed coresets is inde¬ 
pendent of the number of data points n readily implies 
a randomized polynomial-time approximation scheme 
(PTAS) for Bregman hard clustering with /^-similar 
Bregman divergences. We first generate a coreset us¬ 
ing Algorithm 3 and then consider all possible k parti¬ 
tionings of the coreset points. By the coreset property, 
it is guaranteed that the centers of the best partition¬ 
ing are 1 + £ competitive with the optimal solution. 

Theorem 2 Let £ £ (0,3/4), S > 0 and let d^ be a p- 
similar Bregman divergence on domain 1C. Let X C 1C 
be a data set, k £ N and e the desired approximation 
error. Let Q* be the best solution from ©(log y) runs 
of Algorithm f. Then, with pi'obability at least 1 — 5, 

cost H (A, Q*) < (1 + e) mincost H (A, Q). 

Q 

The time complexity is 0(fnkd + 2 poly ( kd /A te ) ) l 0 g L). 


Algorithm 4 Randomized PTAS 
Require: X, k, e, d^, 

1: C 4— (k,e/ 3)-coreset for X with respect to d^. 
2: V -s— Centers of all fc-partitionings of C. 

3: Q* £- argminpgy, W E( w , c )eC w M c ^ p ) 

4: return Q* 


The correctness follows from Lemma 1 and the fact 
that that the number of k partitionings of the coreset 
points is independent of n. As this algorithm is pri¬ 
marily of theoretical interest, we recommend running 
the approach presented in Section 3.1 to solve Breg¬ 
man clustering problems in practice. 

4 STRONG CORESETS FOR SOFT 
CLUSTERING 

In hard clustering each data point is assigned to ex¬ 
actly one cluster. In contrast, in soft clustering each 
data point is assigned to each cluster with a certain 
probability. A prototypical example of soft clustering 
is fitting the parameters of a Gaussian mixture model 
in which one assumes that all data points are generated 
from a mixture of a finite number of Gaussians with 
unknown parameters. Other popular models include 
the Poisson mixture model, the mixture of multinomi¬ 
als and the mixture of exponentials. 

Banerjee et al. (2005) show that there is a bijection 
between regular exponential family distributions and 
Bregman divergences. In particular, the log-likelihood 
of exponential family mixture models can be expressed 
in terms of Bregman divergences (see Section 4.2). By 
considering the resulting objective, one obtains Breg¬ 
man soft clustering. The intuition is that Bregman 
hard clustering can be turned into a soft clustering 
problem by replacing the min function by a soft-min 
function. More formally, let d^ be a Bregman diver¬ 
gence and X C JC be a set of n points. Let k £ N, 
w = (wi, ..., Wk) C ]R fc and 6 = (f?i,...,0fc) C 
and let Q be the concatenation of w and 9. The goal 
of Bregman soft clustering is to minimize 

with respect to Q under the constraint that Wj > 0, 
1 < j < k and ,-i Wj = 1. Similar to Bregman hard 
clustering, the soft clustering problem can be solved 
using an expectation-maximization algorithm (Baner¬ 
jee et al., 2005) which is detailed in Algorithm 5. The 
main difference with respect to hard clustering is that 
a probability distribution over assignments of points 
to clusters is maintained. 


n ( k 

cost s (A, Q) = - In Wj exp(-c4( 
*= 1 \J =1 
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Algorithm 5 Bregman soft clustering with 
Require: A, k, initial parameters {{9j,Wj)}^ =l 


repeat 

for i = 1 to n 
for j = 1 to k 

Wj exp(— dj, 


Vij = 




J2e=i w t ex P(- d^(a )) 

for j = 1 to k 

w i <- l E?=i Vij 

0 S=L» 

3 I2i= i vn 

until convergence 
ret urn { 9j , Wj }j=i 


4.1 Coresets for Bregman soft clustering 

As in the hard clustering case, we define the coreset 
property in terms of a cost function with respect to 
some set of queries. In the case of soft clustering, the 
queries are the parameters of the mixture model. 

Definition 3 Let £ > 0 and k € N. Let d^ be a ji- 
similar Bregman divergence on domain JC and X C 
K, be a set of points. The weighted set C is an (e, k) 
coreset of X for soft clustering with d^ if for any set 
Q C K. of cardinality k 

|cost s (A, Q) — cost s (C,<2)| < £ cost s (A, Q). 


We prove that the same coreset construction provided 
in Section 3.1 for hard clustering also computes valid 
coresets for the soft clustering problem. However, due 
to the higher combinatorial complexity of the underly¬ 
ing function space, more points need to be sampled to 
guarantee that the resulting weighted set is a coreset. 

Theorem 3 Let e € (0,1/4), <5 > 0 and k € N. Let d^ 
be a fi-similar Bregman divergence on domain 1C and 
denote by d^ the corresponding squared Mahalanobis 
distance. Let X be a set of points in K. and let B C X 
be the set with the smallest quantization error in terms 
of d A among ©(logy) runs of Algorithm 2. Let C be 
the output of Algorithm 3 with coreset size 


m = O 


d 2 k 4 + k 2 log \ \ 
h 2 £ 2 )' 


Then, with probability at least 1 — 5, the set C is a 
(e,k)-coreset of X for soft clustering with d^. 

The proof which includes a bound on the combina¬ 
torial complexity of mixtures of regular exponential 
family distributions is provided in Section D of the 
supplementary material. 


4.2 Estimation for exponential family 
mixtures 

Finding the maximum likelihood estimate of a set of 
parameters for a single exponential family can be done 
efficiently trough the use of sufficient statistics. How¬ 
ever, fitting the parameters of a mixture is a notori¬ 
ously hard task - the mixture is not in the exponential 
family. More formally, consider a set X of n points 
drawn independently from a stochastic source that is 
a mixture of k densities of the same exponential fam¬ 
ily. Given X , we would like to estimate the parameters 
{wj, 9j}j = i of the mixture model using maximum like¬ 
lihood estimation, i.e. 


max C(X | 6) = y In 
i w j ’G i 


i —1 


i=i 


WjP^(Xi 



(3) 


As shown by Banerjee et al. (2005) there is a bijec- 
tion between regular exponential families and regular 
Bregman divergences that allows us to rewrite (3) as 


£(A|Q) = ^ln 

i=l 


( k 

^2 w 3 ex P rjj))b^(x) 
V=i 


n 


i—l 

n 

+Y, ln 

i=1 


( k 

^Wj expi-d^Xj,^)) 

V =1 


where d^ is the corresponding Bregman divergence and 
both x and r/j are related to x and 6j by Legendre 
duality via /. Since the first summand is independent 
of the model parameters, maximizing the likelihood of 
the mixture model is equivalent to minimizing 


cost s (A, Q) 




( k 

wj exp {-d^fxj, 6j ) 

V =1 


This is precisely the cost function of the coreset de¬ 
fined in Section 4.1. Hence, our coreset construction 
can be used for maximum likelihood estimation of reg¬ 
ular exponential family mixtures by first constructing 
a coreset and then applying Bregman soft clustering 
(Algorithm 5) on the coreset. Moreover, as £ —» 0 


|£(A| Q) - C{C\Q)\ = |cost s (A, Q) ~ cost s (C, Q)\ ->• 0 


uniformly over Q G Q. As a byproduct of our theo¬ 
retical analysis, we also provide a bound of 0(k 4 d 2 ) 
on the combinatorial complexity of mixtures of regu¬ 
lar exponential family distributions with k components 
and d dimensions (see Theorem 7 in Section D of the 
supplementary material). 
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Figure 1: Relative error in relation to subsample size. For a fixed subsample size, coresets strongly outperform 
uniform subsampling. Shaded areas denote confidence intervals based on 500 independent trials. 


5 EXPERIMENTAL EVALUATION 

In this section, we demonstrate that our proposed core¬ 
set construction can be used in practice to speed up 
both hard and soft clustering clustering problems. We 
compare our proposed coreset construction 4 to solving 
the clustering problem on both uniform subsamples of 
the data set and the original data set itself. 

We first generate a weighted subsample of the data us¬ 
ing either uniform subsampling or the proposed coreset 
construction. To solve the corresponding hard (soft) 
clustering problem on the subsample we apply Algo¬ 
rithm 1 (5) with adaptive seeding using Desampling. 
We then evaluate the computed solution on the full 
data set to obtain the cost C ss and measure the CPU 
time elapsed for both the subsampling and the solving 
step. We average the results over r = 500 independent 
trials. Independently, we also measure the CPU time 
elapsed and the solution quality Cf u u obtained when 
training using the full data set, again averaged across 
r = 500 independent trials. Finally, we calculate the 
relative error ?? = (C ss — Cf u ii) / Cf u u for both uniform 
subsampling and coresets. 

5.1 Data sets and parameters 

Following Banerjee et al. (2005) we use (regular) ex¬ 
ponential family mixture models to generate two syn¬ 
thetic data sets. We sample the model parameters 
from the associated conjugate prior and cluster the 
data using the corresponding dual Bregman divergence 
(as detailed in Section 4.2). 

4 For the proposed coreset construction, we use A = I 
for all data sets except for KDD where we use the inverse 
of the covariance matrix. 


GAUSSIAN. This data set consists of 10,000 points 
which are drawn from a mixture of k = 50 isotropic 
Gaussians in d = 10 dimensions. We use a k- 
dimensional Dirichlet distribution with concentration 
parameter a = 0.5 to sample the mixture weights. The 
means of each of the k components are in turn sam¬ 
pled from a Gaussian with zero mean and variance of 
5000. We solve both hard and soft clustering problems 
with the squared Euclidean distance as the Bregman 
divergence and k = 50 cluster centers. 

POISSON. This data set consists of 10,000 points drawn 
from a mixture of k = 50 multivariate Poisson dis¬ 
tributions in d = 10 dimensions. In a given compo¬ 
nent j, each dimension is independently sampled, i.e. 
Xi ~ Poi(/qj) for each i = 1, 2,... d. For each compo¬ 
nent j and dimension i, the parameter /q i7 is sampled 
from a Gamma distribution with shape a = 10 and 
rate /3 = 10~ 3 . We consider k = 50 cluster centers and 
use the relative entropy as the Bregman divergence. 

CSN. In the Community Seismic Network (Faulkner 
et al., 2011) more than 7GB of cellphone accelerome¬ 
ter data were gathered and used to detect earthquakes. 
The data consists of 80,000 observations and 17 ex¬ 
tracted features. We consider k = 50 cluster centers 
and use the squared Mahalanobis distance where A is 
the inverse of the covariance matrix. 

KDD. This data set was used in the Protein Homol¬ 
ogy Prediction KDD competition and contains 145,751 
training examples with 74 features that measure the 
match between a protein and a native sequence. We 
consider k = 50 cluster centers and use the squared 
Euclidean distance as the Bregman Divergence. 
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Figure 2: Time required to reach a fixed relative error using coresets and uniform subsampling. In all settings, 
coresets outperform uniform subsampling. Shaded areas denote confidence intervals based on 500 trials. 


5.2 Discussion 

Figure 1 shows the relative error p for different sub¬ 
sample sizes. For all data sets and both hard and soft 
clustering, the relative error decreases as we increase 
the number of points in the subsample and even for rel¬ 
atively small subsample sizes our coreset construction 
produces competitive solution compared to solving on 
the full data set. Across all data sets, the proposed 
coreset construction outperforms uniform subsampling 
and achieves a lower relative cost for a given subsam¬ 
ple size. Figure 2 shows that coresets reach a given 


Table 2: Hard clustering on KDD (3000 subsamples) 



Uniform 

Coreset 

Full 

Time (s) 

2.11 

2.17 

176.84 

Speedup 

83.9x 

81.3x 

l.Ox 

Cost (10 9 ) 

290.46 

205.18 

197.02 

Relative error p 

47.4% 

4.1% 

0.0% 


relative error faster than uniform subsamples (even if 
we account for the time required to construct the core¬ 
set). The practical relevance can be seen in Table 2. 
For KDD and a subsample size of s = 3000, we obtain a 
speedup of 81. 3x using coresets, while only incurring 
a relative error of 4.1%. At the same time, uniform 
subsampling leads to a relative error of 47.4%. 

6 OTHER RELATED WORK 

Relatively few approaches were suggested for scal¬ 
able clustering with Bregman divergences. Ack- 
errnann and Blomer (2009) construct weak core¬ 
sets for approximate clustering with //-similar Breg¬ 
man divergences and obtain weak coresets of size 
0(p-fclog(n) log(/c|r| fc log?/)) where |F| < n d ' p °iy( k /e)_ 


For weak coresets, the approximation guarantee holds 
only for queries close to the optimal solution. As such, 
their applicability is severely limited. In contrast, we 
provide strong coresets for all //-similar Bregman di¬ 
vergences and generalize the results to the soft clus¬ 
tering case. Feldman et al. (2013) provide a coreset 
construction (albeit, the coreset is a set of clustering 
features, rather than a weighted subset of the original 
data set) for hard clustering with //-similar Bregman 
divergences with the additional restriction on the con¬ 
vex set S (domain of <f>): every pair p , q of points from 
P, S must contain all points within a ball of radius 
(4 /me)d(p,q) around p for a constant //. In contrast, 
our approach makes no such assumptions and provides 
a strong coreset with improved dependency on e. Feld¬ 
man et al. (2011) provide a coreset construction for the 
specific case of a mixture of semi-spherical Gaussians 
(bounded eigenvalues) and obtain a coreset of the size 
independent of the data set size. 

7 CONCLUSION 

We propose a single coreset construction algorithm for 
both hard and soft clustering with any //-similar Breg¬ 
man divergence. As a separate result, we establish the 
combinatorial complexity of mixture models induced 
by the regular exponential family. Experimental re¬ 
sults demonstrate the practical utility of the proposed 
approach. In particular, the coresets outperform uni¬ 
form subsampling and enjoy speedups of several orders 
of magnitude compared to solving on the full data set 
while retaining guarantees on the approximation error. 
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A BACKGROUND 


Definition 4 (Sensitivity) Let X be a finite set of 
cardinality n and denote by f x (Q) a cost function from 
Qx X to [0, oo). Define f Q = ± Exe* /*(<?) f or al1 
Q £ Q. The sensitivity of the point x £ X with respect 
to a family of queries Q is defined as 

, , fx{Q) 

crn (x) = max —-—. 

Qec f Q 

Definition 5 (Feldman and Langberg (2011)) 

Let X be a finite set of cardinality n and denote by 
f x {Q) a cost function from Q x X to [0, oo). Define 
the set of functions F = {f x (Q ) | x £ Xj from the 
set Q to [0, oo). The dimension dinr(D) of F is the 
minimum integer d such that 

VS C F : |{5 n R | R £ ranges(F)}| < (151 + l) d 

where ranges(D) = {range(Q, r) | Q £ Q, r > 0} and 
rang e(Q,r) = {/ £ F \ f(Q) < r} for every Q £ Q 
and r > 0. 

Theorem 4 (Feldman and Langberg (2011)) 

Let e £ (0,1/4). Let X be a finite set of car¬ 
dinality n and denote by f x {Q) a cost function 
from Q x X to [0, oo). Define the set of functions 
F = { f x {Q) I x £ X} from the set Q to [0, oo). Let 
s : X —> N \ {0} be a function such that 

s(x) > (tq(x), Vx £ X 

and let S = Exe* s[x)/n. For each x £ X, let g x : 
Q —> [0, oo) be defined as g x (Q ) = f x (Q)/s(x). Let 
G x consist of s(x) copies of g x , and let C be a random 
sample of 

+ dim(F)5 2 

functions from the set G = {J xeX G x . Then for every 

Q£Q 


x£X c€lC X^lX 


B ANALYSIS OF ALGORITHM 2 

Definition 6 Let X C be a finite data set. A set 
B c R d , \B\ = (3 is an (a,(3)-bicriteria solution with 
respect to the optimal k-clustering with squared Maha- 
lanobis distance d^ iff 

) d a{x,B) < a min N d^(x,C'). 

CcR d 

xex j C j =k xex 


Theorem 5 Let k £ N. Let X be a finite set of points 
in R. d and d x be a squared Mahalanobis distance. De¬ 
note by B the output of Algorithm 2. Then, with prob¬ 
ability at least 1/2, B is a (a, ff)-bicriteria solution 
with 

a = 16(log 2 k + 2) 

and (3 = k. The probabilty can be boosted to 1 — 6 
by running the algorithm log | times and selecting the 
solution with the lowest cost in terms of squared Maha¬ 
lanobis distance. The time complexity of Alqorithm 2 
is 0(nkd). 

Proof Since d^ is a squared Mahalanobis distance, A 
is symetric and positive definite. Hence, the Cholesky 
decomposition A = U T U is unique where U is an up¬ 
per triangular matrix. For all p,q £ 

{P~q) T A(p-q) = \\Up-Uq\\ 2 2 . 

As a result, the map p —> Up in is an isome¬ 

try (distance preserving map) with regards to the met¬ 
ric spaces (K d , ^/d/t) and (R d , ^^d||. 11 |)- Furthermore, 
the isometry is bijective since U is invertible. 

As a direct consequence of the isometry, Algorithm 2 is 
equivalent to running Desampling in the transformed 
space with the squared Euclidean distance djj.|| 2 . Con¬ 
sider the transformed data set X = {Ux \ x £ X} and 
the transformed solution B = {Ub \ b £ B}. By The¬ 
orem 3.1 of Arthur and Vassilvitskii (2007), we have 
that 


E 


x£X 


Ax,B) 


< 8(log 2 A;+2) min V d | M |2 (x,C). 
ccR d ~ 

\G\=k SeX 


Markov’s inequality implies that with probability at 
least 1/2 


5Z d lhfi!(*> -®) < 16(log 2 k + 2) min ^ d| N || (x, C). 
xGA? \C\-k * GX 

Due to the global isometry defined above, this implies 


(3a(x, B) < 16(log 2 k + 2) min c Ia(x,C). 


xex 


CcR a 

\G\-k 


By construction, \B\ = k which concludes the proof. ■ 
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C HARD CLUSTERING 


C.l Sensitivity 


Lemma 2 Let d^ be a p.-similar Bregman divergence 
on domain 1C and denote by the corresponding 
squared Mahalanobis distance. Let X be a set of points 
in K. and let B C K. be an (a, (3)-bicriteria solution 
with respect to the optimal k-clustering with squared 
Mahalanobis distance d^. For each point x £ X, de¬ 
note by b x the closest cluster center in B in terms of 
d^ and by X x the set of all points x' £ X such that 
b x = b x '. Then, the sensitivity oq (x) of the function 


fx(Q) = mind^a:,^) 

96 Q 

is bounded for all x £ X by the function 

ad A (x ,b x ) d^a/.fr*) _n_ 

2cb |**| c B |**| 

where c B = ^ J2 x 'ex d A (x',B). Furthermore, 



S = 


1 

n 


s (*) 

x£X 


6a + 4/3 


Proof 

We consider an arbitrary point x £ X and an arbitrary 
query Q £ Q and define 

cq = - V d A (x',Q) and c B = - V' d A (x',B). 
n z ' n z ' 

x'£X x'^X 

Since d^ is //.-similar, we have 

fx(Q) = d<j>{x, Q) < d A (x, Q) 

as well as /q > pcq. This implies 

fx(Q) < 1 d a (x,Q) 
fQ ~ T c Q 


By the double triangle inequality, we have that 

d A (x,Q) < 2 d A {x, b x ) + 2d A {b x ,Q) 

which in combination with (4) implies 

fx{Q) < 2 d A (x,b x ) + d A (b x ,Q) 

Iq - T CQ 


Similarly, we have for all x' £ X x 

d A (b x ,Q) < 2 d A (x', b x ) + 2 d A {x', Q) 
and thus 

d A {b x ,Q) < jjr-j- [ d A(x',b x ) + d A (x',Q)\. 


Together with (5), this allows us to bound by 


/q 

d A {x,b x ) i d A{x', b x ) + d A (x\ Q)] 

2cq |**| cq 


By definition, we have both cq > ^ c B and cq > 
p-| J2 x ’&x x ^a{x',Q). This implies that 

4 Tad A {x,b x ) aJ2 x ’ex x dA(x',b x ) n 

2c b + \X x \c B |**|. 

is a bound for the sensitivity <Jq(x) since the choice of 
both x £ X and Q £ Q was arbitrary. 

Using the definition of c B , we further have 

5 = ft H < x ) 


xex 


= -(2 

11 \ n < ^ 


xex 


Ecu'cAC. » b x ) 


T 

\ 

6a + 4/3 
T 

whiclr concludes the proof. 


\X x \c B 


n 

Wx\\ 


C.2 Pseudo-dimension 

Theorem 6 Let k £ N. Let d $ be a Bregman diver¬ 
gence on domain K, CM. d and X a finite set of points 
in K. Define the set F = {f x (Q) \ x £ Xj from KL k to 
[0, 00 ) where f x {Q) = mm qe Q d c / > (x, q). Then, it holds 
that dim(F) = (d + 2 )k. 

Proof Consider an arbitrary subset S C F. We need 
to show 

\{SHR I R £ ranges(F)}| < \S\ d 
which holds if 

| {range(5, Q, r) | QeK dxfc ,r>0}| < \S\ d 

where rang e{S,Q,r) = {f x £ S \ f x (Q) < r}. 

We first show the result for k = 1. For arbitrary q £ R d 
and r > 0, we have 

{/* € S | f x ({q}) < r} 

= {fx e S | d$(x,q) < r} 

= {fx e S I (j){x) - <f{q) - (x-q, V</>(<?)) < r} 

As in Nielsen et al. (2007), we define the lifting map 
x = [x,<f>(x)] and q = [q,c/)(q)]. Furthermore, we set 
s = [—V(/>(g), 1] and t = r + (q, s). We then have 

{/* S S | <j>{x) - <j>{q) - (x-q, V</>(<?)) < r} 

= {fx e s I {(fix) - <j>(q), 1 ) + {x - q,-V(f{q)) < r} 
= {/* £5|(i-g,s)<r} 

= {/* e S | (x, s) < t} . 



Strong Coresets for Hard and Soft Bregman Clustering 


For every Bregman ball defined by q and r, there is 
hence a corresponding halfspace in the lifted d + 1 
dimensional space. We may obtain a bound on the 
pseudo-dimension of Bregman balls by bounding the 
pseudo-dimension of halfspaces. Using Theorem 3.1 of 
Anthony and Bartlett (2009), we have 


{rang e(S,{q},r) \ q £ R d ,r > 0} 



1=0 x 7 1=0 
= (|S| + l) d+2 


which shows the claim for k = 1. 

We now extend the result to k £ N centers. For arbi¬ 
trary Q £ R dx/c and r > 0, we have 

range (S', Q, r) = {f x £ S \ f x (Q ) < r} 

fx G s I mind^(a;, q) < r \ 
qeQ J 

= U {.fx e S I d 0 (x,9) < r} . 

96 Q 


Proof Apply Theorem 5, Lemma 2 and Theorem 6 
to Theorem 4. The results can be extended to hold 
with arbitrary probability 1 — 5 by Theorem 4.4 of 
Feldman and Langberg (2011). ■ 


C.4 Proof of Lemma 1 

Lemma 1 Let e £ (0,1) and let d^ be a p-similar 
Bregman divergence on domain 1C. Let X C K be a 
data set, k £ N and C be an (e /3, k)-coreset of X for 
hard clustering with A^. Let Q* x and Qq denote the 
optimal set of cluster centers for X and C respectively. 
Then, 

cost H (A\ Qc) < (1 + e) cost H (df, Q* x ). 

Proof By the coreset property, we have 

cost H (A’, Q* c ) < - — cost H (C, Q* c ) 

< t—— w- cost H (C, Q* x ) 

1 — e/6 

< \ cost u(X,Q* x ) 

< (1 + e) cost H (Af, Q*X)■ 


Hence, 

| {range (S, Q,r)\Q£ R dxk , r > 0} | 

< |{range(S,{g},r) | q £ K d ,r > 0}| A 

< (|S| + l) (d+2)fe 

which concludes the proof since the choice of S was 
arbitrary. ■ 


C.3 Proof of Theorem 1 

Theorem 1 Let e £ (0,1/4), <5 > 0 and k £ N. Let d^ 
be a p-similar Bregman divergence on domain K. and 
denote by d^ the corresponding squared Mahalanobis 
distance. Let X be a set of points in K. and let B C X 
be the set with the smallest quantization error in terms 
of Aa among 0(log |) runs of Algorithm 2. Let C be 
the output of Algorithm 3 with 

dk 3 + k 2 log \ \ 

p 2 e 2 J ’ 

Then, with probability at least 1 — 5, the set C is a 
(e,k)-coreset of X for hard clustering with d^. 


D SOFT CLUSTERING 

D.l Sensitivity 

Lemma 3 Let d^, be a Bregman divergence that is p- 
similar on the set K, and d^ denote the corresponding 
squared Mahalanobis distance. Denote by X a finite 
set of points in tC, A k be the k-simplex and define Q = 
A k xJC k . Forx £ X and Q = [u>i,..., w^, 0i, ..., 0^] £ 
Q define 

U(x\ Q) = - hi f Wj exp(- A^x, 0j)) j 
f A (x | Q) = - Inf Wj exp(— d A (x, 0j)) J 

' 3 

Then, for all x £ X and Q £ Q: 

i) U(x | Q) > 0. 

ii) f^(x | Q) > A^(x, Q). 

Hi) kf a | Q) < U(x | Q) < fA{x | Q). 
iv) f A {x | Q) < 2 A a {x, b) + 2f A {b \ Q). 
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Proof 

i) For a fixed x £ X, consider the discrete random 
variable Y that takes value exp(— d ( / ) (x, 9j)) with 
probability Wj, for 1 < j < k, Y^ , j=i w j ~ 1- 
Clearly, E[F] < 1 which implies 

U(x | Q) = -ln(E[F]) > 0. 


To prove the other direction note that 


U(x I Q) = -M ^Wj ex p(-d^(a;,0j))) 

S'=i ' 

< - Inf Wj exp(— d A (x, 9j))] 


\?'=i 
= Ia(x | Q) 


ii) Let d^(x, Q) = min^.gQ d^x, 0j). Since 

k k 

y, Wj exp(— d^(x, 0j) < 'y Wj exp(— d^(x, Q)) 

3 = 1 3 =1 

k 

< exp(— d«j(x, Q)) *y Wj < exp(-d .^(x,Q)), 
3=1 

it follows that 

f(x I Q) = - Inf y Wj exp(— d^(x, 0j)) j 

' 3 ' 

> - ln(exp(— d^(x, Q))) 
d^(x, Q) • 


since d j,(x,6j) < d A (x,9j) by definition, 
iv) By triangle inequality it holds that 

- Inf ywj exp(-dA(x,6>j)) j 

' 3 ' 

< -ln^^w i exp(-2d j4 (x,&) - 2d A (b,0j))^J 


= 2 d^(x, b) — In ( Wj 


exp(-d A (b,0j)) 

< 2d^(x, b) — 2 In E w j ex p(— dA(&, 0j)) j 

' 3 ' 

= 2d A (x,6) +2 f A (b | Q), 
by Jensen’s inequality on g(x) = x 2 . 


iii) From d j,(x,9j) > pd A (x,9j) it follows that 
/ k 

ln( y Wj exp(— dj,(x, 6j)) 
i 

< In( y w.j exp(-/i (\ A (x.6j))\ 

S=i ' 

= 4 e 


3 =1 
k 


exp(-d A (x,0j)) 


/ * \ m 

<ME Wj exp(- d A (x,0j)) ) 

’'.7—1 ' 

= A* In( ^ Wj exp(— d A (x, 9j)) J 

'j=l ' 


Lemma 4 Let d^ 6e a p-similar Bregman divergence 
on domain 1C and denote by d^ the corresponding 
squared Mahalanobis distance. Let X be a set of points 
in 1C and let B C 1C be an (a, ft)-bicriteria solution 
with respect to the optimal k-clustering with squared 
Mahalanobis distance d^- For each point x £ X, de¬ 
note by b x the closest cluster center in B in terms of 
d A and by X x the set of all points x' £ X such that 
b x = b x i. Then, the sensitivity <Jq{x) of the function 

fx{Q) = U(x | Q) = — Inf y^ Wj exp(— dj,(x, pj)) j, 

' 3 

is bounded for all x G X by the function 

ad A (x,b x ) d A (x',b x ) n 

2 c B \X x \c B \X x \. 



by Jensen’s inequality on g[x) = x M which is con¬ 
cave for p £ (0,1]. Hence 


k 


U(x | Q) > —p In I Wj exp(— d^(x, 9j)) 


3=1 


which implies 


Ii/a{x | Q) < U(x | Q). 


where cb = ^ J2 x 'ex &a{x ', B). Furthermore, 

S=^y s{x) = Uda + lft). 

1 xex F 


Proof Consider an arbitrary point x £ X and an 
arbitrary query Q £ Q and define 

CQ = - y d A {x',Q) and c B = - d A (x',B). 

n n 
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By Lemma 3 it holds that 

Iq = ~ 51 2: = mcq, (6) 

n z —' n z —' 

and Vx € 


Hence, we can upper bound the number of possible di¬ 
chotomies by upper bounding the number of connected 
components in the parameter space. More information 
on this method is available at Anthony and Bartlett 
(2009) (Chapter 7). We state the necessary Lemmas 
from Anthony and Bartlett (2009) and Schmitt (2002). 


f A {b x | Q) < 2d A (x,b x ) + 2f A (b x \ Q). 
Summing over all x £ X x yields 

f A (b x | Q) < tJ-t 5] [d A (x',b x ) + f A (b x | Q)\. (7) 

' x ' x'ex x 

From Lemma 3 and (6) we can conclude that 

fx(Q) < 1 f A (x | Q) < 2 ' d A (x, b x ) f A (b x | Q) ' 
Iq ~ h cq ~ V l cq c q 

which combined with (7) suffices to bound —~^ by 


Lemma 5 Let k be a natural number and suppose G 
is the class of real-valued functions in the variables 
y\,-.-,Ud satisfying the following conditions: For ev¬ 
ery f £ G there exist affine functions g \,..., gk in 
the variables y ±,... ,yd such that f is an affine com¬ 
bination of yi ,... ,yd and e 91 ,..., e 9k . Then G has 
solution set components bound 

e>(V 2fe2 ). 

Definition 7 A class of functions F is closed under 
addition of constants if for every c€l and / £ J the 
function z —> f(z) + c is a member of F. 


4 d A (x,b x ) 
y 2 cq 


[d A (x',b x ) + f A (b x | Q)] 
CQ \X x \ 


The following Lemma is due to Schmitt (2002) and 
slightly improves on the result from Anthony and 
Bartlett (2009). 


By definition of B and Lemma 3 it follows that 

cq = - VI Sa{x', Q )>- V d ^ x ', Q) > a • c B 
n n -f—b 

x'€lX x'€lX 

which implies that 

ad A (x,b x ) aJ2 x , eXx d A (x',b x ) n 
2 c B \X x \ c B + \Xx\_ 

is a bound for the sensitivity erg (x) since the choice of 
both x £ X and Q £ Q was arbitrary. 

Finally, by substituting c B in s(x) it follows that 



S = ~ 51 s (*) 


xex 


= — ( 2a + — 5^ 

ii \ n a . —/ 


\ 

6a + 4 f3 
h 


xGX 


Hx'^X ■> bx) 
\X x \cb 


which concludes the proof. 



Lemma 6 Let F be a class of real-valued functions 

(2/i, ...,yd,xx n )-+ f(yi, ...,y d ,x i,.. .,x n ) 

that is closed under addition of constants and where 
each function in F is C d in the variables yi ,... ,y d - If 
the class 

G = {(yi,---,yd) ->• f(yi,---,y d ,s): f e R”} 

has solution set components bound B then for any sets 
{/i, ..., fk} C F and {si,..., s m } C R", where m > 
d/k, the set T C {0, l} mfc defined as 

T = {(sgn(/i(a, si)),..., sgn(/i(a, s m )), 

sgn(/ 2 (a,si)),...,sgn(/ 2 (a 5 ^m))j • • • 5 
sgn(/ fc (a, si)), ..., sgn (f k (a i ^m)) • Oj G M } 

satisfies 



We now prove the following Lemma which is necessary 
to bound the pseudo-dimension. 


D.2 Pseudo-dimension 


Lemma 7 Let w = (wq ,...,Wk) C R fc+1 , y = 
(2/ii) ■ • ■, Vid, • • •. Vkd) C U. kd and x = (xi,..., x d ) C 
R d . Define 


To bound the pseudo-dimension of the function class 
we will first obtain a solution set components bound. 
Intuitively, we partition the parameter space into con¬ 
nected components where parameters from the same 
connected component realize the same dichotomy. 


k / d 

/(w, y, x) = w 0 + 5] Wj e x P I 51 %■»*» 
j =1 Vi=l 

and let F = {/(w,y, •) | w C R fc+1 ,y C R kd }. Then 
dim (F) = 0(k i d 2 ). 
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Proof Following Schmitt (2002) and Anthony and 
Bartlett (2009) we partition the functions / 6 J into 
categories based on whether u\ = 0 or w; > 0, 1 < i < 
k, which results in 2 k categories. For each category we 
introduce new variables w £,..., where 

In Wi if Wi > 0 
0 otherwise 

for 1 < i < k. Choose an arbitrary category. Within 
this category the functions / for an input x £M. d can 
be expressed in the form 

{w 0 , w*, 0)->w o + b 1 e< +e ^ x + • • • + b k e w * +e * x 
where 


Since T is closed under addition of constants, we have 
from Lemma 6 that 

\T\ < B{emk/W) w 

which is by construction an upper bound on the num¬ 
ber of dichotomies that are induced on any set of to 
vectors {(xi, u ±),..., (x m , u m )}. Since the choice of 
the category was arbitrary, it follows that 

\T\ < B{emk/W) w 2 k , 

by considering all 2 k categories. By definition, shatter¬ 
ing in vectors implies that all 2 m dichotomies must be 
induced. Since T contains all the induced dichotomies 
it must hold that 


h = 


if < = 0, 
otherwise. 


1 < i < k. 


Let I = {(xi, Ui),..., {x m , u m )} be an arbitrary 
set where xi,...,x m are input vectors in and 
Ui ,..., u m are real numbers. We will estimate the 
number of dichotomies that are induced on set X by 
functions of the form 


(x,z) -> sgn(/(x) - z). 

Let T C K mfc contain all dichotomies induced by func¬ 
tions / from a fixed category on a set X 


r = {(sgn(/i(a,xi,ui),..., 

sgn {fk{a,x m ,u m )) : a £ } 


to < log B + W \og{emk/W) + k log 2. 


Using the fact that Ina < a/3 + ln(l//3) — 1, a,/3 > 0 

In 2 

2 W 


with a = to and (3 = we obtain 


to 2 W 

W log TO- < — + IF log , 
2 e m2 


which implies 

2k 

to < 2 log B + 2 W log + 2k log 2. 

In 2 

Substituting the solution set components bound B we 
conclude that 


to. = 0{W 2 k 2 + Wk\og{Wk) + Wlogk) = 0{W 2 k 2 


where each /) is of the form 
(y, x, z) —> c 0 + yo + Cl e yi+VllXll+ ' +VldXld + ... (8) 

| Qj^^yk~\~ykl'^kl~\~‘“~\~ykd'^kd _ g 

The sets of variables y t and y % j play the role of the 
function parameters, x,; 7 are the inputs and 2 is the 
input variable for u\,..., u rn . Let T denote the class 
of the functions arising for real numbers Cq,c-i ,,c k 
with c 0 = 0 and c* £ {0,1} for * = 1 ,...,k. We 
have introduced Cq to make T closed under addition 
of constants. Now, for the vectors x\,... , x m and real 
numbers Ui ,..., u m consider the function class 

Q = {y ->• f{y,Xi,Ui) : f £ T . ,m}. 


The number of parameters W is O/kd + k) = O/kd) 
which implies dim(J r ) = 0(k 4 d 2 ) as claimed. ■ 

Finally, we are ready to prove the main result by ap¬ 
plying Lemma 7 on a suitable reparametrization of the 
Exponential family. 

Theorem 7 Let k £ N. Let d^ be a Bregman diver¬ 
gence on domain KL C ]R d and X a finite set of points 
in 1C. Let T = { f x {Q ) | x £ Xj where 

fx(Q) = U(x | Q) = - Inf ^2 w 3 ex P( - c 40u Mi)) ) • 

U'=i ' 


Upon inspection of equation (8) we can see that every 
/ £ Q has k exponents that are affine functions in d 
variables. By Lemma 5 the class Q has solution set 
components bound 

B = o(2 w °~ k2 y 


Then, it holds that dim)^ 7 ) = 0(k 4 d 2 ). 

Proof We have that Vx £ dom{(f>) 

exp(— d^(x, n))b^{x) = exp {x T 6 - ip{d) - gy,{x)) 
= exp{x T d) 
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where x is the sufficient statistic, 9 is the natural pa¬ 
rameter, /i is the corresponding expectation parame¬ 
ter, ip(0) is the cumulant function, x = [x, —1,-1], 
and 6= [6,—tp(6),—g^(x)). Hence, 

/ k 

U(x\Q) = - In f w 3 ex P(- Mi)) 

S'=i 
/ k 

= - ln f Y w i ex P (x T 0j))) 

i 

By Lemma 7 the function inside the natural loga¬ 
rithm has pseudo-dimension of 0{k A d 2 ). The upper 
bound on the pseudo-dimension is preserved under 
monotonic transformations (Anthony and Bartlett, 
2009) which concludes the proof. ■ 


D.3 Proof of Theorem 3 


Theorem 3 Let e € (0,1/4), 6 > 0 and k € N. Let d^ 
be a g-similar Bregman divergence on domain K, and 
denote by d^ the corresponding squared Mahalanobis 
distance. Denote by X a set of points in K. Let C be 
the output of Algorithm 3 with coreset size 


to = O 


d 2 k 4 + k 2 log \ \ 

p 2 e 2 )' 


Then, with probability at least 1 — 6 the setC is a (e, k)- 
coreset of A! for soft clustering with d^. 


Proof Apply Theorem 5, Lemma 4 and Lemma 7 
to Theorem 4. The results can be extended to hold 
with arbitrary probability 1 — 6 by Theorem 4.4 of 
Feldman and Langberg (2011). ■ 



